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Abstract 

Kernel approximation via nonlinear random feature maps is widely used in speeding up kernel machines. There 
are two main challenges for the conventional kernel approximation methods. First, before performing kernel approx¬ 
imation, a good kernel has to be chosen. Picking a good kernel is a very challenging problem in itself. Second, 
high-dimensional maps are often required in order to achieve good performance. This leads to high computational 
cost in both generating the nonlinear maps, and in the subsequent learning and prediction process. In this work, we 
propose to optimize the nonlinear maps directly with respect to the classification objective in a data-dependent fash¬ 
ion. The proposed approach achieves kernel approximation and kernel learning in a joint framework. This leads to 
much more compact maps without hurting the performance. As a by-product, the same framework can also be used 
to achieve more compact kernel maps to approximate a known kernel. We also introduce Circulant Nonlinear Maps, 
which uses a circulant-structured projection matrix to speed up the nonlinear maps for high-dimensional data. 


1 Introduction 

Kernel methods such as the Support Vector Machines (S VMs) ED are widely used in machine learning to provide 
nonlinear decision function. The kernel methods use a positive-definite kernel function K to induce an implicit non¬ 
linear map (j) such that K (x, y) = (</>(x), (f>(y)), x, y £ R d . This implicit feature space could potentially be an infinite 
dimensional space. Fortunately, kernel methods allow one to utilize the power of these rich feature spaces without 
explicitly working in such high dimensions. Despite their popularity, the kernel machines come with high compu¬ 
tational cost due to the fact that at the training time it is necessary to compute a large kernel matrix of size N x N 
where N is the number of training points. Hence the overall training complexity varies from 0(N 2 ) to 0(N 3 ), which 
is prohibitive when training with millions of samples. Testing also tends to be slow due to the linear growth in the 
number of support vectors with training data, leading to 0(Nd) complexity for d-dimensional vectors. 

On the other hand, linear SVMs are appealing for large-scale applications since they can be trained in 0(N) time 
|25j fl~6l [4Tl l and applied in O(d) time, independent of N. Hence, if the input data can be mapped nonlinearly into a 
compact feature space explicitly, one can utilize fast training and testing of linear methods while still preserving the 
expressive power of kernel methods. 

Following this reasoning, kernel approximation via explicit nonlinear maps has become a popular strategy for 
speeding up kernel machines ||40| . Formally, given a kernel K (x, y), kernel approximation aims at finding a nonlinear 
map Z(-), such that 

K(x,y) « Z(x) T Z(y) 

However, there are two main issues with the existing nonlinear mapping methods. Before the kernel approxima¬ 
tion, a “good’' kernel has to be chosen. Choosing a good kernel is perhaps an even more challenging problem than 
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approximating a known kernel. In addition, the existing methods are designed to approximate the kernel in the whole 
space independent on the data. As a result, the feature mapping often needs to be high-dimensional in order to achieve 
low kernel approximation error. 

In this work, we propose an alternative formulation that optimizes the nonlinear maps directly in a data-dependent 
fashion. Specifically, we adopt the Random Fourier Feature framework BOl for approximating positive definite shift- 
invariant kernels. Instead of generating the parameter of the nonlinear map randomly from a distribution, we learn 
the parameters by minimizing the classification loss based on the training data (Section|4]l. The proposed method can 
be seen as approximating an “optimal kernel” for the classification task. The method results in significantly more 
compact maps with very competitive classification performance. As a by-product, the same framework can also be 
used to achieve compact kernel approximation, if the goal is to approximate some predefined kernels (Section [5]). 
The proposed compact nonlinear maps are fast to learn, and compare favorably to the baselines. In addition, to make 
the method scalable for very high-dimensional data, we propose to use circulant structured projection matrices in the 
nonlinear maps (Section[8]i. This further improves the computational complexity from O(kd) to Oik log d) and the 
space complexity from 0{kd) to 0(k), where k is the number of nonlinear maps, and d is the input dimensionality. 


2 Related Works 

Kernel Approximation. Following the seminal work on explicit nonlinear feature maps for approximating positive 
definite shift-invariant kernels |40j, nonlinear mapping techniques have been proposed to approximate other forms of 
kernels such as the polynomial kernel j27l[39l . generalized RBF kernels H42H , intersection kernels 11341 , additive kernels 
l43l . skewed multiplicative histogram kernels l33l . and semigroup kernel Il47fl . Techniques have also been proposed 
to improve the speed and compactness of kernel approximations by using structured projections |32l , better quasi 
Monte Carlo sampling |f46l , binary code |50l[35l , and dimensionality reduction OTll . Our method in this paper is built 
upon the Random Fourier Feature |40l for approximating shift-invariant kernel, a widely used kernel type in machine 
learning. Besides explicit nonlinear maps, kernel approximation can also be achieved using sampling-based low-rank 
approximations of the kernel matrices such as the Nystrom method 1451 fT5lf30l . In order for these approximations to 
work well, the eigenspectrum of the kernel matrix should have a large gap l48l . 

Kernel Learning. There have been significant efforts in learning a good kernel for the kernel machines. Works 
have been proposed to optimize the hyperparameters of a kernel function fl0ll29l . and finding the best way of com¬ 
bining multiple kernels, i.e., Multiple Kernel Learning (MKL) l4lf3l fT8l!T2ll . A summary of MKL can be found in 
l20l . Related to our work, |6l l 191 propose to optimize shift-invariant kernels. Different from the above, the proposed 
approach can be seen as learning an optimal kernel by directly optimizing its nonlinear maps. Therefore, it is a joint 
kernel approximation and kernel learning. 

Fast Nonlinear Models. Besides kernel approximation, there have been other types of works aiming at speeding 
up kernel machine (8). Such techniques include decomposition methods ||37| |9l. sparsifying kernels m, limiting the 
number of support vectors I28ll38l . and low-rank approximations Eiia. None of the above methods can be scaled 
to truly large-scale data. Another alternative is to consider the local structure of the data to train and apply the kernel 
machines locally PTI [23l [26 1 [24] . However, partitioning becomes unreliable in high-dimensional data. Our work is 
also related to shallow neural networks as we will discuss in later part of this paper. 

3 Random Fourier Features: A Review 

We begin by reviewing the Random Fourier Feature method (40j, which is widely used in approximating positive- 
definite shift-invariant kernels. A kernel K is shift-invariant, if K(x, y) = K( z) where z = x — y. For a function 
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( 1 ) 


K( z) which is positive definite on R rf , it guarantees that the Fourier transform of K( z), 

admits an interpretation as a probability distribution. This fact follows from Bochner’s celebrated characterization of 
positive definite functions. 

Theorem 1. RTf A function K £ (7(R d ) is positive definite on W l if and only if it is the Fourier transform of a finite 
non-negative Borel measure on R d . 

A consequence of Bochner’s theorem is that the inverse Fourier transform of K2(0), i.e., K( z), can be interpreted 
as the computation of an expectation, i.e., 

= ( 2 ^< 2 > 
=2E 0^ p(e) [cos(0 T x + b) cos(0 T y + 6)] , 

6~£/(0,27t) 

where p(9) = (27r) _d,/2 /C(0) and U (0, 27t) is the uniform distribution on [0, 27r). If the above expectation is approxi¬ 
mated using Monte Carlo with k random samples {<?,, 6j}* =1 , then K(x., y) ~ (Z(x), Z(y)) with 

Z(x) = y/2/k [cos(0fx + hi),cos(0fx + bk)] T ■ (3) 

Such Random Fourier Features have been used to approximate different types of positive definite shift-invariant 
kernels, including the Gaussian kernel, the Laplacian kernel, and the Cauchy kernel (40). Despite the popularity and 
success of Random Fourier Feature, the notable issues for all kernel approximation methods are that: 

• Before performing the kernel approximation, a known kernel has to be chosen. This is a very challenging task. 
As a matter of fact, the classification performance is influenced by both the quality of the kernel, and the error in 
approximating it. Therefore, better kernel approximation in itself may not lead to better classification performance. 

• The Monte-Carlo sampling technique tries to approximate the kernel for any pair of points in the entire input space 
without considering the data distribution. This usually leads to very high-dimensional maps in order to achieve low 
kernel approximation error everywhere. 

In this work, we follow the Random Fourier Feature framework. Instead of sampling the kernel approximation 
parameters Qi and bi from a probability distribution to approximate a known kernel, we propose to optimize them 
directly with respect to the classification objective. This leads to very compact maps as well as higher classification 
accuracy. 


4 The Compact Nonlinear Map (CNM) 

4.1 The Framework 

Consider the following feature maps, and the resulted kernel based on the Random Fourier Features proposed 

in fl40TI : 

K&(pc,y) = Z(x) T Z(y), Zfix) = y/2jk cos(0? x), i = l,...,k. (4) 

By representing © = ,0k], we can write Z(x) = cos(0 T x), where cos(-) is the element-wise consine 

function. 

1 For simplicity, we do not consider the bias term which can be added implicitly by augmenting the dimension to the feature x. 
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Algorithm 1 Optimizing w with fixed 0 

1: INPUT: initialized w, ||w|| < l/y/\. 

2: OUTPUT: updated w. 

3: for t = 1 to Ti do 

4: Sample M points to get A. and compute the gradient V w . 

5: W <— W — (l/Af)V w . 

6: w min{l, 1/A||w||} w. 

7: end for 


Proposition 1. For any 0, the kernel function K, defined as AT©(x, y) = Z (x) T Z (y), is a positive-definite shift- 
invariant kernel. 

Proof The shift-invariance follows from the fact that, for any x,y £ R 

, . , . cos(x — y) — sin(x — y) . 

cos(x) cos (y) = - 1 — -, a function of x — y. 

The positive definiteness follows from a direct computation and the definition. □ 

In addition, it has been shown in the Bochner’s theorem that such a cosine map can be used to approximate any 
positive shift-invariant kernels. Therefore, if we optimize the “kernel approximation” parameters directly, it can be 
seen as approximating an optimal positive definite shift-invariant kernel for the task. In this work we consider the 
task of binary classification using SVM. The proposed approach can be easily extended to other scenarios such as 
multi-class classification and regression. 

Suppose we have N samples with +1/-1 labels as training data (xi,j/i),..., (xjv,2/jv)- The Compact Nonlinear 
Maps (CNM) jointly optimize the nonlinear map parameters © and the linear classifier w in a data-dependent fashion. 


A 1 . , 

argmin -w T w + — ^ L (y u w T Z{yt i )) 

W© ^ iv 

7 — 1 


In this paper, we use the hinge loss as the loss function: T(t/i, w T Z{xi)) = max(0,1 — yyvt T Z(xi)). 


(5) 


4.2 The Alternating Minimization 

Optimizing Equation[5]is a challenging task. A large number of parameters need to be optimized, and the problem 
is nonconvex. In this work, we propose to find a local solution of the optimization problem with Stochastic Gradient 
Descent (SGD) in an alternating fashion. 

For a fixed 0, the optimization of w is simply the traditional linear SVM learning problem. 


N 


argmin + 1 £ L ( Vi , w T Z(x 8 )) . 


( 6 ) 


We use the Pegasos procedure Bill to perform SGD. In each step, we sample a small set of data points A. The 
data points with non-zero loss is denotes as A+. Therefore, the gradient can be written as 


V w = Aw — 


|A| 


y cos(0 2 x). 


(7) 


(x,y)£-4+ 


Each step of the Pegasos procedure consists of gradient descent and a projection step. The process is summarized 
in Algorithm [T] 
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Algorithm 2 Optimizing © with fixed w 
l: INPUT: initialized 0. 

2: OUTPUT: updated ©. 

3: for t = 1 to T 2 do 

4: Sample M points to get A. and compute the gradient V®. 

5: © «- 0 - (l/Af)V®. 

6 : end for 


Algorithm 3 The Compact Nonlinear Map (CNM) 

1: Initialize 0 as the Random Fourier Feature. 

2: Choose w such that ||w || < i/Va. 

3: for iter= 1 to T do 

4: Perform T\ SGD (Pegasos PH I ) steps to optimize w, shown in Algorithm!]] 

5: Perform T 2 SGD steps with to optimize 0, shown in Algorithm^ 

6 : end for 


For a fixed w, optimizing 0 becomes 


1 N 

argmin --Vi (y h w T Z{x i )) . 

w -/V . 

i=l 


( 8 ) 


We also preform SGD with sampled mini-batches. Let the set of sampled data points be A, the gradient can be 
written as 




Wj 

\A\ 


Y ysm(0fx)x, 

(x,y)eA+ 


(9) 


where A+ is the set of samples with non-zero loss, and w, is the i-th element of w. 

The overall algorithm is shown in Algorithmic] The sampled gradient descent steps are repeated to optimize w 
and 0 alternatively. We use a © obtained from sampling the Gaussian distribution (same as Random Fourier Feature) 
as initialization. 


5 CNM for Kernel Approximation 

In the previous section, we presented the Compact Nonlinear Maps (CNM) optimized to achieve low classification 
error. This framework can also be used to achieve compact kernel approximation. The idea is to optimize with respect 
to kernel approximation error. For example, given a kernel function K, we can minimize © in terms of the MSE on 
the training data: 


N N 

argmin VV (K(xi,Xj) - Z{xi) T Z(xj)) 2 . (10) 

® i=1 j=l 

This can be used to achieve more compact kernel approximation by considering the data under consideration. Note 
that the ultimate goal of a nonlinear map is to improve the classification performance - therefore this section should 
be viewed as a by-product of the proposed method. 

For the optimization, we can also perform SGD similar to the former section. Let A be the set of random samples, 
we only need to compute the gradient in terms of 0: 
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Table 1: 8 UCI datasets used in the experiments 


Dataset 

Number of Training 

Number of Testing 

Dimensionality 

USPS 

7,291 

2,007 

256 

BANANA 

1,000 

4,300 

2 

MNIST 

60,000 

10,000 

784 

Cl FAR 

50,000 

10,000 

400 

FOREST 

522,910 

58,102 

54 

LETTER 

12,000 

6,000 

16 

MAGIC04 

14,226 

4,795 

10 

IJCNN 

49,990 

91,701 

22 


^ ^/C(x, x') —cos(© T x) T cos(0 T x')'j sin(0, T x) cos(0, T x , )x i . (11) 

x.x'e.4 ^ ' 

6 Discussions 

We presented Compact Nonlinear Maps (CNM) with an alternating optimization algorithm for the task of binary 
classification. CNM can be easily adapted to other tasks such as regression and multi-class classification. The only 
difference is that the gradient computation of the algorithm need to be changed. We provide below a brief discussion 
regarding adding regularization, and the relationship of CNM to neural networks. 

6.1 Regularization 

One interesting fact is that, the cos function has an infinite VC dimension. In the proposed method, with a fixed 
w, if we only optimize © with SGD, the magnitude of 0 will grow unbounded, and this will lead to near-perfect 
training accuracy, and obviously, overfitting. Therefore, a regularizer over © should lead to better performance. We 
have tested different types of regularizations of 0 such as the Frobenius norm, and the l\ norm. Interestingly, such 
a regularization could only marginally improve the performance. It appears that early stopping in the alternating 
minimization framework provides reasonable regularization in practice on the tested datasets. 

6.2 CNM as Neural Networks 

One can view the proposed CNM framework from a different angle. If we ignore the original motivation of the 
work i.e., kernel approximation via Random Fourier Features, the proposed method can be seen as a shallow neural 
network with one hidden layer, with cos(-) as the activation function, and the SVM objective. It is interesting to note 
that such a “two-layer neural network”, which simulates certain shift-invariant kernels, leads to very good classification 
performance as shown in the experimental section. Under the neural network view, one can also use back-propagation 
as the optimization method, similar to the proposed alternating SGD, or use other types of activation functions such as 
the sigmoid, and ReLU functions. However the “network” then will no longer correspond to a shift-invariant kernel. 
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(a) MAGIC04 



(e) FOREST 



(b) MNIST 



(f) CIFAR 




(c) USPS (d) BANANA 




(g) LETTER (h) IJCNN 


Figure 1: Compact Nonlinear Map (CNM) for classification. RFFM: Random Fourier Feature Map based on RBF ker¬ 
nel. CNM-kerapp: CNM for kernel approximation (Section [ 5 }. CNM-classification: CNM for classification (Section 
[4j). RBF: RBF kernel SVM. Linear: linear SVM based on the original feature. 


7 Experiments 

We conduct experiments using 8 UCI datasets summarized in Table [Q The size of the mini batches in the opti¬ 
mization are empirically set as 500. The number of SGD steps in optimizing © and w is set as 100. We find that 
satisfactory classification accuracy can be achieved within a few hundred iterations. The bandwidth of the RBF kernel 
in classification experiments, and the kernel approximation experiments is set to be 7 = 2/cr 2 , where a is the average 
distance to the 50th nearest neighbor estimated from 1,000 samples of the dataset. Further fine tuning of 7 may lead 
to even better performance. 

7.1 CNM for Classification 

FigureQ]shows the classification accuracies. CNM-classification is the proposed method. We compare it with three 
baselines: linear SVM based on the original features (Linear), kernel SVM based on RBF (RBF), and the Random 
Fourier Feature method (RFFM). As shown in the figures, all the datasets are not linearly separable, as the RBF SVM 
performance is much better than the linear SVM performance. 

• For all the datasets, CNM is much more compact than the Random Fourier Feature to achieve the same clas¬ 
sification accuracy. For example, on the USPS dataset, to get 90% accuracy, the dimensionality of CNM is 8, 
compared to 512 of RFFM, a 60x improvement. 

• As the dimensionality k grows, accuracies of both the RFFM and CNM improve, with the RFFM approaching 
the RBF performance. In a few cases, the CNM performance can be even higher than the RBF performance. 
This is due to the fact that CNM is “approximating” an optimal kernel, which could be better than the fixed RBF 
kernel. 
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Figure 2: Compact Nonlinear Map (CNM) for kernel approximation. RFFM: Random Fourier Feature based on RBF 
kernel. CNM-kerapp: CNM for kernel approximation (Section^. 


7.2 CNM for Kernel Approximation 

We conduct experiments on using the CNM framework to approximate a known kernel (Section 0. The kernel 
approximation performance (measured by MSE) is shown in Figure [2] CNM is computed with dimensionality up 
to 128. For all the datasets, CNM achieves more compact kernel approximations compared to the Random Fourier 
Features. We further use such features in the classification task. The performance is shown as the green curve (CNM- 
kerapp) in FigureQ] Although CNM-kerapp has lower MSE in kernel approximation than RFFM, its accuracy is only 
comparable or marginally better than RFFM. This verifies the fact that better kernel approximation may not necessarily 
lead to better classification. 


8 An Extension: Circulant Nonlinear Maps 

Kernel approximation with nonlinear maps comes with an advantage that SVM can be trained in O(N), and 
evaluated in 0(k) time, leading to scalable learning and inference. In this paper, we have presented CNM where 
the projection matrix of the Random Fourier Features is optimized to achieve high classification performance. For 
(/-dimensional inputs and /(-dimensional nonlinear maps, the computational and space complexities of both CNM and 
RFFM are O(kd). CNM comes with the advantage that k can be much smaller than that for RFFM to achieve a similar 
performance. One observation from Section [7] is that though CNM can lead to much more compact maps, it still has 
better performance when higher-dimensional maps are used. In many situations, it is required that the number of 
nonlinear map k is comparable to the feature dimension d. This will lead to both space and computation computation 
complexity 0(d 2 ), which is not suitable for high-dimensional datasets. One natural question to ask is whether it is 
possible to further improve the scalability in terms of the input dimension d. 

Structured matrices have been used in the past to simulate a fully randomized matrix in many machine learning 









































settings, including dimensionality reduction l44l l22l l2l. binary embedding J49), and deep neural networks Ifm . In 
addition, the fast Johnson-Lindenstrauss type transformations can also be used in speeding up the Random Fourier 
Features in kernel approximation |32fl , and locality sensitive hashing fl4l . This comes with an advantage that linear 
projection with a suitably designed structured matrix can be more more space and time efficient. In this section, we 
show that by imposing the circulant structure on the projection matrix, one can achieve similar kernel approximation 
performance compared to the fully randomized matrix. The proposed approach reduces the computational complexity 
to 0 (k log d), and the space complexity to 0 (k), when k > d. 


8.1 Circulant Nonlinear Maps 


A circulant matrix R G M. dxd is a matrix defined by a vector r = 

(n>,n 

. 


' r 0 

Td-l 

r-2 

ri 


ri 

7 

*» 

O 

S- 


r2 

R = circ(r) : = 

Td- 2 

n r 0 


rd-i 


Xd-1 

Vd-2 

ri 

ro 


Let D be a diagonal matrix with each diagonal entry being a Bernoulli variable (±1 with probability 1/2). For 
x G K l , its (-/-dimensional circulant nonlinear map is defined as: 

Z(x) = cos(RDx), R = circ(r). (13) 

The diagonal matrix D is required in order to improve the capacity when using a circulant matrix for both binary 
embedding Il49l and dimensionality reduction m. Since multiplication with a Bernoulli random diagonal matrix 
corresponds to random sign flipping of each element of vector x, this can be done as a pre-processing step. To 
simplify the notation, we omit this matrix in the following discussion. 

A circulant matrix has the space complexity of O(d) . The other advantage of using the circulant projection is that 
the Fast Fourier Transform (FFT) can be used to speed up the computation. Denote © as the operator of a circulant 
convolution. Based on the definition of a circulant matrix, 

Rx = r © x. (14) 

The convolution above can be computed more efficiently in the Fourier domain, using the Discrete Fourier Transform 
(DFT), for which a fast algorithm (FFT) is available. 

Z(x) = <^(jr- 1 (^(r)o^(x))), (15) 

where o denotes the element-wise product. T(-) is the operator of DFT, and J r_1 (-) is the operator of inverse DFT 
(IDFT). As DFT and IDFT can be efficiently computed in O(dlogd) time with FFT f36l , the proposed approach 

has time complexity O(dlogd). Note that the circulant matrix is never explicitly computed or stored. The circulant 

projections are always performed by using FFT. 

What we described above assumed a circulant nonlinear map with k = d. When k < d, we can still use the 
circulant matrix R G W lxd with d parameters, but the output is set to be the first k elements in Equation [13] When 
k > d, we use multiple circulant projections, and concatenate their outputs. This gives the computational complexity 
0(k log d), and space complexity 0{k). Note that the DFT of the feature vector can be reused in this case. 
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(a) USPS 


(b) CIFAR 


(c) MNIST 


Figure 3: MSE of Random Fourier Feature, and randomized circulant nonlinear map. 


8.2 Randomized Circulant Nonlinear Maps 

Similar to the Random Fourier Features, one can generate the parameters of the circulant projection ( i.e ., the 
elements of vector r in Equation 1131) via random sampling from a Gaussian distribution. We term such a method 
randomized circulant nonlinear maps. Figure 0 shows the kernel approximation MSE of the randomized circulant 
nonlinear maps and compares it with the Random Fourier Features. Although with much better computational and 
space complexity, it is interesting that the circulant nonlinear map can achieve almost identical MSE compared to the 
Random Fourier Features. 


8.3 Optimized Circulant Nonlinear Maps 

Following the CNM framework, one can optimize the parameters in the projection matrix to improve the per¬ 
formance using alternating minimization procedure with the classification objective. The step to optimize classifier 
parameters w is the same as described in section l4~2l The parameters of the projection are now given by circulant 
matrix R. Thus, the step of optimizing R requires computing the gradient with respect to each element of vector r as: 

dw cos(Rx) _ _ w t ( gin ( Rx ) Q = - s _ ) . i (x) T (w o sin(Rx)), (16) 

OTi 

where : R d —> R d , circularly (downwards) shifts the vector x by one element. Therefore, 

V r (w T cos(Rx)) = —[s->o(x), s_>i(x), • • • , s_,.( d _i)(x)] T (w o sin(Rx)) (17) 

= — circ(s_ > i(rev(x)))(w o sin(Rx)) 

= —s_>.i(rev(x)) ©(wo sin(r © x)), 

where rev(x) = (x d -i,x d - 2 , ■ ■ ■ ,x 0 ), s_>i(rev(x)) = (x 0 , x d - lt x d - 2 , ■■ ■ ,xi). 

The above uses the same trick of converting the circulant matrix multiplication to circulant convolution. Therefore, 
computing the gradient of r takes only O(dlogd) time. The classification accuracy on three datasets with relatively 
large feature dimensions are shown in Table[2] The randomized circulant nonlinear maps give similar performance to 
that from the Random Fourier Features but with much less storage and computation time. Optimization of circulant 
matrices tend to further improve the performance. 
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Dataset (dimensionality k) 

Random Fourier Feature 

Circulant-random 

Circulant-optimized 

USPS {d) 

89.05 ±0.65 

89.40 ± 1.02 

91.96 ± 0.45 

USPS (2d) 

91.90 ±0.29 

91.87 ±0.11 

93.08 ± 0.96 

MNIST (d) 

91.33 ±0.05 

91.01 ±0.03 

92.73 ± 0.21 

MNIST (2d) 

92.95 ±0.42 

93.22 ±0.30 

94.11 ± 0.24 

Cl FAR (d) 

69.14 ±0.64 

65.21 ±0.18 

71.17 ±0.68 

Cl FAR (2d) 

71.15 ± 0.28 

68.56 ±0.70 

71.11 ±0.46 


Table 2: Classification accuracy (%) using circulant nonlinear maps. The randomized circulant nonlinear maps have 
similar performance as of the Random Fourier Features but with significantly reduced storage and computation time. 
Optimization of circulant matrices tend to further improve the performance. 


9 Conclusion 

We have presented Compact Nonlinear Maps (CNM), which are motivated by the recent works on kernel approx¬ 
imation that allow very large-scale learning with kernels. This work shows that instead of using randomized feature 
maps, learning the feature maps directly, even when restricted to shift-invariant kernel family, can lead to substantially 
compact maps with similar or better performance. The improved performance can be attributed mostly to simultaneous 
learning of kernel approximation along with the classifier parameters. This framework can be seen as a shallow neural 
network with a specific nonlinearity (cosine) and provides a bridge between two seemingly unrelated streams of works. 
To make the proposed approach more scalable for high-dimensional data, we further introduced an extension, which 
imposes the circulant structure on the projection matrix. This improves the computation complexity from 0(kd) to 
O(klogd) and the space complexity from 0{kd) to 0(k), where d is the input dimension, and k is the output map 
dimension. In the future it will be interesting to explore if the complex data transforms captured by multiple layers of 
a deep neural network can be captured by learned nonlinear maps while remaining compact with good training and 
testing efficiency. 

Acknowledgment. We would like to thank Weixin Li, David Simcha, Ruiqi Guo, and Krzysztof Choromanski for the 
helpful discussions. 
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